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Functional Multi-Layer Perceptron: a 
Nonlinear Tool for Functional Data Analysis 



Abstract 

In this paper, we study a natural extension of Multi-Layer Perceptrons (MLP) 
to functional inputs. We show that fundamental results for classical MLP can be 
extended to functional MLP. We obtain universal approximation results that show 
the expressive power of functional MLP is comparable to that of numerical MLP. We 
obtain consistency results which imply that the estimation of optimal parameters 
for functional MLP is statistically well defined. We finally show on simulated and 
real world data that the proposed model performs in a very satisfactory way. 

Key words: Functional data analysis, Multi-Layer Perceptron, Universal 
approximation, Supervised learning, Curves discrimination, Learning consistancy, 
Nonlinear functional model, Spectrometric data 
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1 Introduction 



Functional Data Analysis (FDA, see lRamsay and Silverman! (119971 ) for a com- 
prehensive introduction to FDA methods) is an extension of traditional data 
analysis to functional data. In this framework, each individual is characterized 
by one or more real valued functions, rather than by a vector of lR n . An impor- 
tant feature of FDA is its ability to take into account dependencies between 
numerical measurements that describe an individual, especially smoothness. 
If we represent for instance the size of a child at different ages by a vector, 
traditional methods generally consider each value to be independent of the 
others. In FDA, the size is represented as a function (in general a regular one) 
that maps measurement times to centimeters. 



In order to deal with irregular measurements and to allow numerical manipula- 
tion of functions, FDA replaces actual observations by a simple functional rep- 
resentation. Spline based approximation is the most commonly used method, 
as it represents each individual by a smooth function. Kernel or wavelet based 
approximations are also used. FDA has bee n successfully applie d to real prob- 
lems such as clim atic variation forecasti ng (IBesse et al.l (120001 )). acidification 
process studying (Abraham et al. ( 20031 )). analysis of children size evolution 
( Ramsay and Silverman! ( 19971 ) ). land usage prediction based on satellite im- 
ages (IBesse et al.l (120041 ) ). etc. 



In this paper, we focus on a precise yet very general task: we assume that we 
observe functions associated to a classical target variable. This variable can be 
for instance a class label, in which case we perform supervised classification. 
If the variable is a real valued vector, we perform a regression. The key idea is 
that, whereas individuals are described thanks to functions, we still want to 
predict a traditional numerical value. In mathematical terms, we have n exam- 
ples described by s + 1 variables, (g[, . . . , g % 3 , t t )ie{i ) .„, n }, where f is the target 
variable (with t % e W) and where each g\ is a function belonging to a given 
functional space. The problem is to predict f based on (g{, . . . , g l s ). In the 
framework of FDA, several methods have been proposed to solve this kind of 
problem, for instance the linear functional model (see e.g . lHastie and Mallows 



1993), 



(1999). 



Marx and Eilersl (119961) 



Cardot et all (12003 ) and 



ysisje.g. 



Ramsay and Silverman! (119971 ). ICardot et al 



James! (120021 )). functional discriminant anal 



James and Hastid (120011)'). functiona l Slice Inverse Regression (see [Li 
( 1991 ) for the classical SIR and Ferre and Yaol ( 2003 ) for its f unctional version) 
and n o n-parametric kernel base d fun ctional estima t ors (s ee iFerraty and Vieu 
(l2002h . iFerratv and Vieul (l2003f ) and IFerratv et all fl2002h ). 



In this paper, we show how Multi-Layer Perceptrons (MLP) can be directly 
applied to functional data, so as to provide nonlinear semi-parametric function 
classification and regression. We introduce a major difference with traditional 
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FDA methods: our model works directly with the studied functions, without 
using a simplified representation. This avoids restrictions on the functional 
weight representation which can therefore be adapted to the context. For in- 
stance, functional data with low dimensional input spaces can be manipulated 
thanks to generalized linear models (such as splines), whereas MLP are used 
for functions with high dimensional input spaces. 



When functional data are perfectly known, the extension of MLP we propose is 
a particular cas e of an extens i on pro pos ed and studied from purely theoretical 
point of view in lStinchcombei (119991 ) . In IStinchcombd (119991 ). the author shows 
that traditional universal approximation results for MLP can be extended to 
(almost) arbitrary input spaces, including infinite dimensional vectorial spaces. 
These results rely on the approximation of continuous linear forms defined on 
the MLP input space. In our work, we show how to carry out this kind of 
approximation in practice, for instance by using traditional MLP. We show 
this way that functional MLP are universal approximators and therefore that 
they can be used to model complex dependencies between a real valued target 
variable and functional inputs. 



Moreover, we show that training a parametric functional MLP on a finite 
number of function examples is statistically valid, as the optimal parame- 
ters obtained thanks to those examples provide a consistent estimation of 
asymptotic optimal parameters, even if we assume limited knowledge on each 
function example (i.e., each function is only known thanks to a finite num- 
ber of (inpu t, outp u t) pa irs). This is a direct translation of classical results, 
presented in White! ( 1989 ) for instance, available for numerical MLP. 



The rest of the paper is organized as follows. In the first part, we assume that 
we have perfect knowledge of manipulated functions: we start by introducing 
in section 2 the pro posed functio n al ML P model. Then we show in section 
3 how the results of IStinchcombd (Il999l ) can be adapted to functional MLP 
to show they are universal approximators. In the second part, we take into 
account sampling: consistency of functional MLP training is studied in section 
4. Section 5 compares our approach to alternative neural solutions, on a theo- 
retical point of view. Then, section 6 gives some experimental results both on 
simulated and real world data. Proofs are presented in section 8. 
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2 Functional Multi-Layer Perceptrons 



2.1 Functional data 



As stated in the introduction, an observation is described by s + 1 values, 
(gx, . . . , g s , t), where each g\ is a function (and t G R°). More precisely, we 
assume that \i\ is a cr-finite positive Borel measure defined on M. Ul and that gi 
belongs to LP 1 (/ii). 



2.2 Functional neurons 



The extension of numerical neurons to functional inputs is straightforward. 
Indeed a n input MLP neuron is characterized by a fixed activation function, 
T, a function from R to R, by a vector from R n (the weight vector, w) and 
by a real valued threshold, b. Given a vectorial input x G R", the output of 
the neuron is N(x) = T(w.x + b). 

This formula is based on the linear form x i— > w.x. When x = (g 1; . . . , g s ) G 
LP 1 {ill) x ... x L Pa (/i s ), a linear form can be constructed thanks to integrals, 
for instance: 

(g 1 ,...,g s ) fi9l (!) 

where (/i,...,/ s ) are measurable functions chosen such that /{^j G L 1 (//;). 
Using this linear form, we can define a functional neuron: 

Definition 1 ^4 functional neuron on E = L pi (//i) x . . . x L Ps (/i s ) zs defined 
thanks to a fixed activation function T from R to R ; weight functions fi (such 
that figi G L 1 ^)) and a real valued threshold, b. It calculates 

N( gi ,...,g s )=T\b + f^ J f l9l d^. (2) 



This func t ional neuron is a specia l case of general ne u rons proposed in 



Sandbergi (119961 ): ISandberg and Xul (119961 ): IStinchcombei (119991 ). The main 



drawback of this model is that it uses functional weights rather than numer- 
ical ones. This problem can be solved by using parametric representation of 
functions. More precisely, we assume given s functions Fi,...,F s such that 
(hypothesis H a ): 



(1) Wi C W* 

(2) Fi is a function from W\ x R u > to R 
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(3) for each wi G Wi, Fi(u>i, .) G L m (ni) where qi is the conjugate exponent 
associated to p\ 

For instance, Fi can be implemented thanks to a numerical MLP (in this 
case, Wi is the weight vector of the MLP) or thanks to the first functions of 
a topological basis of L qi ([ii) (in this case, we have Fi(wi,x) = Y^iL\ w u'^i{ x ) ■, 
where (ifii)ieN is the considered topological basis). 

We can now introduce the definition of a parametric functional neuron: 

Definition 2 A parametric functional neuron on E = L Pl (fii) x . . . x L Pa {jj, s ) 
is defined thanks to a fixed activation function T from K to R ; a weight vector 
w G W% x ... x W s and a real valued threshold, b. It calculates 

N( gi ,...,g s )=T^b + j2 J F(w h x) gi (x) d#0&)) • (3) 
2.3 Functional MLP 

As a functional neuron gives a real output, we have to use numerical neurons 
except in the first layer of a functional MLP. In particular, a one hidden 
layer parametric functional perceptron with one functional input and one real 
output computes a function of the following form: 

H{g) = X] «i T (bi + J Fi(wi, x)g(x) dfi(x)j , (4) 

where the Oj are real numbers, as well as the bi, and Wi are parameter vectors 
for Fi. 

Of course, it is obvious to extend those definitions to more than one output 
and/or hidden layer. The only difference between a functional n-hidden layer 
perceptron and a numerical one is that, as stated above, we use functional 
neurons only in the first layer. It is also obvious to define a general functional 
MLP by using functional neurons rather than parametric functional neurons. 



3 Universal approximation 

3.1 Definitions and notations 



We use notations and definitions from 



Stinchcombe 



(11999 ). 
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3.1.1 Functional spaces and metrics 

We denote C{A, B) the set of continuous functions from A to B, where A and 
B are two topological spaces. As a special case, C n is the set of continuous 
functions from R n to R. M n is the set of (Borel) measurable functions from 
R n to R. We denote dc the metric on M n that gives uniform convergence over 
compact subsets: 

dc(f,g)= ^ min | SU P \f( x ) -9(x)\,l\. (5) 

When K is a compact subset of X a topological space, we define px a metric 
on the set of functions from K to R by: 

Pk{I,9) = sup | /(a) -g{x)\. (6) 



Definition 3 Let X be a metric space with d the associated metric. Let C 
and S be two subsets of X . S is d-outside dense in C if the d-closure of S 
contains C, and S is d-inside dense in C if the d-closure of SnC contains C . 

When C — X, d-inside density is equivalent to rf-outside density and is simply 
called d-density. 



3.1.2 One hidden layer perceptrons 

Definition 4 If T is a function from R to R and n a positive integer, Sj> 
is the set of functions exactly computed by one hidden layer perceptrons with 
n inputs and one output, and using T as activation function, i.e. the set of 
functions of the form h(x) = Yh=\ PiT(wi.x + bi) where p <E N , E M, and 
(wi,bi)eR n+1 . 

Definition 5 If X is a topological vector space, A a subset of X* and T 
a function from R to R ; S* (A) is the set of functions exactly computed by 
one hidden layer generalized perceptrons with input in X , one real output, 
and weight forms in A, i.e. the set of functions from X to R of the form 
K x ) = E?=i PiT(k(x) + k) where p e N, # e R, ^ e R and U e A. 

Note that A can in fact be any set of functions from X to R ; in which case 
we do not introduce constant terms bi. 

According to this definition, functional one hidden layer perceptrons are a spe- 
cial case of Stinchcombe generalized perceptrons in which X is a product of LP 
spaces and A is given by linear forms of the form l{g\, . . . , g s ) = Z)f=i / fi9i d-Pi 
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(or l{g u ...,g s 
ceptrons) . 



Yfi=\ I Fi(wi,x)gi(x) dfii(x) for parametric functional per- 



3.2 Universal approximation with functional MLP 



Several approximation results show that S% (A) is in si de or outside dense 



in different func t ional spac es. In d eed IStinchcombd (119991 ) (as well as 



Sandberg and Xul (119961 ) and IChenl (119981 )) proposes approximation results 
for (A) for almost arbitrary spa ces X (see theorem 5.1 and corollaries 5.1.2 
and 5.1.3 from IStinchcombd (119991 )). In order to apply those general results to 
practical cases, complex technical properties have to be satisfied by A. In this 
section, we show that those properties are satisfied by very general functional 
one hidden layer perceptrons. 

Corollary 6 Let fi be a finite positive Borel measure on R n . Let 1 < p < oo 
be an arbitrary real number and q be the conjugate exponent of p. Let V be 
a dense subset of L q (p). Let Ay be the set of linear forms on L p (ft) of the 
form 1(f) = f fg dfi, where g 6 V. Let T be a measurable function from R to 
R such that is dc-inside (resp. dc-outside) dense in C 1 . Then (Ay) 
is px-inside (resp. px-outside) dense in C(K,M), where K is any compact 
subset of L p (fi). 

Corollary 7 Let fx be a finite positive compactly supported Borel measure on 
R n . Let T be a measurable function from R to R ; such that is dc-inside 
(resp. dc-outside) dense in C 1 . Let V be a subset of L°°(fi) dc-inside (or dc- 
outside) dense in C n . Then (Ay) is px-outside dense in C(K,M.), where 
K is any compact subset of L 1 (fi). 



3.3 Discussion 



Corollary 6 shows that as long as we can approximate functions in L q (ft) and in 
C , then an one hidden layer perceptron can be used to approximate functions 
in C(K,M.) , where K is a compact subset of L p (fi). Previous works give very 
weak conditions on T that i mply dg inside or ou tside density of Sj> for C 1 , 



see for instance Theorem 1 in Leshno et al. (Il993h and Theorem 1 in iHorniki 



(119931 ). Basically, T must be non polynomial and Riemann integrable on a 
non-degenerate compact interval of R, properties that are obviously satisfied 
by popular activation functions such as tanh. 



The generalized MLP used in corollary 6 uses linear forms in Ay and is there- 
fore a functional MLP with weight functions chosen in V a dense subset of 
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L q (n). In practical situation, weight functions are represented thanks to para- 
metric functions (F(w, .)). This constraint does not introduce any problem, 
as long as we cho ose a pa r ametr ic universal approximator for L q (n). Thanks 
to Theorem 1 of iHornikl (Il99lh . we can use for instance one hidden layer 
perceptrons based on activation function U (i.e., V = S^) as long as U is 
measurable, bounded and non constant (as p > 1, q < oo and Theorem 1 
applies). Other models can be used (B-spline, wavelet, Fourier series, etc.) but 
imply in general additional restrictions on the considered functional space. 



The proof of corollary 6 could be extended to p = 1, and therefore, one 
mi ght wonder why co rollary 7 is useful. As pointed out in the introduction 
of IStinchcombd (119991 ). no S% set is dense in L°°(/i). Therefore, corollary 6 
main assumption (V is dense in L q (fi)) cannot be satisfied by MLP based 
approximation. This reduces greatly the interest of corollary 6 for p = 1. 
That' s why corollary 7 is useful: as shown for instance by Theorem 1 of lHornik 
(119931 ). Sly can be used to provide approximation to continuous functions on a 
compact set. Therefore, the situation for p = 1 is quite similar to the one that 
stands for p > 1, except that the measure has to be compactly supported. 



This means that when K is a compact subset of a L p (/i) functional space, 
any function from C(K,M.) can be approximated to a given precision level 
by a functional MLP that uses a finite number of parameters (because linear 
forms can be represented for instance thanks to numerical MLPs). Despite the 
radical change in the input space dimension (from R n to a compact subset of 
a functional space), we can still effectively approximate continuous functions. 



It is very common in FDA to assume that studied functions are smooth, 
that is at least continuous. If we only consider compact input spaces for those 
functions, their case is covered by corollary 6. Indeed, continuous functions (or 
more regular functions) on a compact subset Z of R™ are obviously elements 
of L°°(A) where A is the restriction of the Lebesgue measure to Z. Moreover a 
compact subset K of a space of regular functions (considered with the uniform 
norm) is a compact subset of L°°(X). This means that any continuous function 
from K to R can be approximated by a functional M LP as long as L l {\) can 
also be approximated (this can be done thanks to SJ} iHornik jl99lK 



Extension of proposed corollaries to multiple functional inputs is straightfor- 
ward. In fact, corollaries are based on approximation of linear forms on X the 
input space of extended neurons. When X = L Pl (fii) x . . . x L Pr (/i r ), approxi- 
mation of elements of X* is obtained thanks to approximations of elements of 
(L Pi (/ij))*, because a linear form on X is a linear combination of linear forms 
on L Pi ([ii) (this fact was used to define the functional neuron). 
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4 Consistency of Functional MLP learning 

4-1 Introduction 

As explained in the introduction, our goal is to explain a target variable iel" 
thanks to functional observations (gi, . . . , g s ). Basically, we assume that there 
is a functional relationship such that t ~ F(gi, . . . , g s ) and we try to model F 
thanks to a functional MLP. Thanks to universal approximation results given 
in the previous section, we know that any regular F can be approximated by a 
functional MLP. Nevertheless, an important problem remains: F is obviously 
unknown and a correct approximation as to be constructed thanks to a limited 
number of examples of this mapping. 

4-2 Probabilistic framework 

4-2.1 Functional data 

Let us now describe the probabilistic framework of our problem. All random 
quantities will be defined on a given probability space (Q, A, P). For the sake 
of simplicity, we consider only the case of an unique functional input. More 
precisely, we make the following hypothesis (Hi,): 

(1) Z is a compact subset of W a 

(2) (G*,T*)j g pj is an i.i.d. sequence of random elements with values in 
C(Z,M.) x M° (i.e., each G % is a measurable function from Q to C(Z,M.) 
considered with its Borel sigma algebra and each T % is a random vector 
in M°, and the sequence is i.i.d.) 

Hypothesis on the observed functions are quite different from those of corol- 
laries 6 and 7: on the one hand if;, are stronger than corollaries hypothesis 
as they consider only continuous functions defined on a compact set, on the 
other hand they are weaker as observed functions do not belong to a compact 
subset of C(Z,R). 

4-2.2 Parametric model 

We try to model the relationship between G % and T % thanks to a special kind 
of parametric model (a parametric functional MLP) that has the following 
form: 



H(w, g) = U (w , / F x (wi, x) g(x) dfi(x), . . . , / F k (w k , x) g(x) dfi(x) , (7) 
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where w = (wq, u>i, . . . , wu) G W = Wq x W\ x . . . x Wk, the Fi are parametric 
models as in parametric neurons, U is a regular function and \i a finite positive 
Borel measure (defined on Z) . This parametric form is quite similar to t he on e 



f 

proposed in the context of Slice Inverse Regression bv iFerre and Yaol ((20031) 



Our main motivation here is to use a general form that covers functional multi- 
layer perceptrons without making too much hypothesis on their architecture 
(number of layers, activation functions, linear terms, etc.). For instance, if U 
is defined as follows: 

k 

U(w ,oi,...,o k )=^2aiT(bi + oi), (8) 
i=i 

with w o = (ai, b±, . . . , a^, bk), then H(w, g) is exactly the output of a functional 
one hidden layer perceptron, as given by equation 4. As a side effect, we 
cover any model that uses integrals to transform an input function into a real 
number. 

Some restrictions are needed on Fi functions and on U (hypothesis H c ): 

(1) for < I < k, Wi is a compact subset of M. Vl 

(2) for 1 < I < k, Fi is a function from W\ x Z to R such that: 

(a) for each x G Z, Fi(.,x) is continuous 

(b) for each wi G Wi, Fi(u>i, .) is measurable 

(c) Fi is dominated on Wi, i.e., there is a measurable function d\ G L p (n) 
(with p > 1) such that for for all w G W\ and x G Z, |Fj(io,a;)| < 
di(x). 

(3) [/ is an uniformly continuous function from Wq x M fc to M° 

(4) U is bounded 

Hypothesis H c are quite natural and are fulfilled in practical settings: 

• Compacity of the parameter space is a classical hypothesis in consistency 
results. 

• Useful choices for Fi are numerical MLP and basis expansions: for the for- 
mer, continuity is mandatory in practice as optimal parameters are obtained 
thanks to gradient based algorithms (and therefore Fi is in general differen- 
tiable with respect to wi); for the latter, continuity is obvious as Fi is linear 
with respect to Wi. 

• As stated before, when Fi is obtained thanks to a numerical MLP, it is a con- 
tinuous function. As Wi and Z are compact sets, the domination hypothesis 
is automatically fulfilled. When Fi is obtained thanks to basis expansion, a 
natural hypothesis is to assume that basis functions belong to L p (n). Then, 
compacity of W\ implies again that the domination hypothesis is fulfilled. 

• U corresponds to the non functional part of a functional MLP, it is in general 
natural to assume that it is uniformly continuous. Indeed, popular activation 
functions such as tanh and the logistic function are uniformly continuous 
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and moreover, Wq is compact, therefore when U represents a MLP based on 
standard activation functions, it is uniformly continuous. Moreover, popular 
activation functions are also bounded and the assumption that U is bounded 
is also natural. 



4-2.3 Optimal model and consistency 

The learning phase in neural network applications consists in finding the best 
parameters for a given task. In our framework, we assume given a distancd^ c 
on K° and we assess the quality of the neural model at the evaluation point G l 
thanks to c(T\ H(G\ w)). We define the global error made by the parametric 
model H for parameters w G W by: 

X(w) = E(c(T\H(G\w))), (9) 

where E means expectation. Learning is in fact a parameter estimation prob- 
lem in which we try to optimize X(w) in order to find a vector w G W*, where 
W* C W is the set of minimizer of X(w). The practical problem is that X(w) 
cannot be exactly calculated and is approximated thanks to a finite number 
of realizations of (G*,T*). More precisely, we define an empirical error by: 

X n (w) = -J2c(T\H(G\w)). (10) 

This empirical error can be minimize d to produce w n an estimation of an op- 



timal parameter vector. IWhitd (119891 ) shows that for numerical MLP, w n is a 
strongly consistent estimation of an optimal parameter vector. More precisely, 
if d denotes the distance on W, then lim^oo d(w n , W*) = almost surely. 
Among technical hypothesis needed to ensure this result, we adapt a domina- 
tion hypothesis to the functional framework (hypothesis Ha)'- c{T l , H (G l , w)) 
has to be dominated, in the sense that there is a positive function c max from 
R° to R such that: 

(1) \/w G W t g G C{Z,R) and t G R°, c{t,H{g,w)) < c max (t) 

(2) E( Cmax 

(7\)) < OO 

For functional MLP, hypothesis Hd are quite natural. Indeed, hypothesis H c 
(4) makes H(g, w) bounded and therefore domination turns into an hypothesis 
on Ti and c. For instance if c is the Euclidean distance in R°, then domination 
is obtained if Ti has a second order moment. 

Compared to the numerical case, we have two additional difficulties in the 
functional framework: we are working with random elements with values in a 



functional space, whereas IWhitd (119891 ) assumes that observations belong to a 



c has not really to be a distance, it can be any continuous positive function. 
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finite dimensional space; moreover, perfect knowledge of observed functions is 
seldom the case and we have to take into account that functions are measured 
at a finite number of observation points. 



4-2.4 Function observations 



In practical situations, each observed function is described by a finite number 
of input /output pairs, such g{xj))j e {i t ...^ m }. We choose the following 

mathematical model (hypothesis H e ): 

(1) (XpjgNjgN is a sequence of independent sequences of random variables 
defined on A, P) and with values in Z. 

(2) All Xj are identically distributed and the induced probability measure 
on Z is fi = Px- 

(3) (£j) ie wjeN is a sequence of independent sequences of random variables 
defined on (Q, A, P) and with values in KL 

(4) For all i, (£j)jgN and (Xj) j€N are independent. 

(5) E (sfy = and E ( £j 9 ) < oo, where q is the conjugate exponent to p 
used in hypothesis H c (2-c). 



For each z, the sequence (A])j GN corresponds to observation points for the func- 
tion G % and the sequence corresponds to measurement errors for these 
observation points. More precisely, if g l , Xj and e*- are respectively realizations 
of G l , Xj and fj, we assume that we observe the sequence: y % - = g l (x l j) + e*. 
Moreover, we assume that we know only the m 1 first values of this sequence. 



Hypothesis H e are natural in this framework, especially independence. The 
main hypothesis is H e (2), which says that the way observation points are 
randomly chosen (i.e., Px) corresponds to the way integrals are calculated 
(//). On an intuitive point of view, this means that when an input function is 
matched to functional weights thanks to integral calculation, probable obser- 
vation points have more weight that less probable ones. This is quite natural. 



As functions are only known thanks to observations, we cannot compute any- 
more the integrals which are approximated thanks to empirical means. More 
precisely, we replace / Fi (wi, x) g l (x) d/j,(x) by: 
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Therefore, the empirical error X n (w) given in equation 10 is approximated by 
the following empirical error: 

\™(w) = 

-in / / -r m l -r m' \ \ 

- E c k ? Uo, - E ^K. •»>;• • • • . ^ E *})i/JJ J . (12) 

where i l is a realization of T l and m = infi<j< n m\ 

This empirical error, which is based on finite number of numerical values, 
is easy to evaluate in practice and can be used to obtain empirical optimal 
parameters, w™. Our goal is to show that w™ is a consistent estimator of an 
optimal parameter vector, i.e. converges to W*. 



4-3 Consistency 

Consistency of the proposed estimation of optimal parameters is given by the 
following theorem: 

Theorem 8 Under hypothesis Hb, H c , Hd and H e , we have P-almost surely: 

lim lim d(w™, W*) = 0. 



n— >oo m— >oo 



The theorem is an extension of lWhitel fll989h . It suffers from a small limitation: 



the limit is a sequential one, which means that in order to reach a given 
distance to W*, the number of evaluation points for each function (m) depends 
on the number of functions (n). 



5 Alternative methods for functional inputs 



5.1 Functions observed at identical points 



In some particular cases, functions are all observed thanks to an unique se- 
quence of observation points, that is there is a sequence (xj)jgn such that for 
any considered function g, we know g(xj) for all j. Moreover, we assume that 
we use the same number of observation points for each function (denoted by 
m). These cases include for instance situations in which measurement points 
are under user control (e.g., spectroscopic measurements corresponding to spe- 
cific frequencies). Or course, this case is covered by theorem 8. 
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On a practical point of view, the situation is clearly simpler than the gen- 
eral one. Indeed each function g can be considered as a vector in W 71 , i.e., 
(g(xi), . . . ,g(x m )). Therefore, we can submit these multivariate observation s 



to a numerical MLP. This approach was proposed in I Chen and Chenl (119951 ). 
Let us consider the special case of a single hidden layer perceptron with one 
real output. Such a MLP maps a function g to: 

k I m \ 

V(g) = £a i T[b l + 'Ec ij g{x j )). (13) 

i=l V 3=1 J 

In such a setting, our model maps g to: 



m 



H{g) = j^a l T[b l + -Y, %M x j) I ■ ( 14 ) 

8=1 \ m j=l ) 

On a practical point of view, the main advantage of our approach over the 
numerical one in this setting is the increased flexibility induced by the use of 
the parametric functions F^. We can for instance take into account smoothness 
of observed functions by using simple parametric functions (i.e., MLP with a 
small number of hidden nodes, B-splines with just a few nodes, etc.). This 
allows to reduce the number of free parameters in the model while incorpo- 
rating expert knowledge into it, whereas in the numerical approach, we need 
in each neuron one connection weight for each function observation point. 

Moreover, it is obvious that an appropriate choice of parametric functions Fi 
allows to reproduce exactly the numerical model, which appears this way as a 
special case of the functional approach. Indeed each Fj can be an interpolation 
spline or a kernel based model designed such that for any set of weights c^, 
there are weight vectors u>, such that Fi(v)i, Xj) = Cy. 



Finally, the universal approximation result given in I Chen and Chenl (119951 ) is 



less general than ours as it relies on uniform sampling. 

For all those reasons, we believe that the functional approach is more inter- 
esting than the multivariate approach, even for uniformly sampled functions. 
Experiments exposed in section 6 confirm this point of view. 



5.2 Function representation 



When functions are not observed at identical evaluation points, there is still 
a natural alternative approach to ours. The main idea is simply to transform 
each functional observation into a representation that allows easy manipula- 
tion. More precisely, a list of observations (xj,g(xj) + ej)jg{i,... )m } is replaced 
by an approximation of g, A(g) constructed thanks to the observations. 
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The only reasonable solution is to use a pseudo-linear model to approximate 
the input/output mapping for each observed function. Indeed, the number of 
input functions can be quite large in real world experiments and fitting a non 
linear model to each function will be very time consuming. Morever, the only 
difference between A(g) and g is that the former is known exactly whereas 
the latter is not. Representation does not solve the function manipulation 
problem. If we use non linear models, calculation of a scalar product between 
A(g) and a weight function is still a complex problem that cannot be solved 
without an approximation method for integral calculation. We are more or 
less back to our original problem, except that we have now perfectly known 
functions (hopefully smoothed by the representation algorithm). We do not 
discuss this approach anymore because it is in fact an extended version of our 
method (whose theoretical properties remain to be studied). 

The case of pseudo-linear models allows to construct what might be seen as an 
alternative to our approach. Indeed, A(g) is obtained thanks to a truncated 
basis expansion, a very comm on approach in FDA thoroughly illustrated i n 
Ramsay and Silverman! (119971 ) and more recently in iBesse and Cardotlfl2003h . 



First of all, we need to assume that studied functions belong to L 2 (fi). We chose 
a free system of £ 2 (/i), (<fii)i<i< p - Then each list of observations (xj,g(xj) + 
ej)je{i,..., m } is replaced by A(g) the projection of g on the vectorial space 
spanned by </>i, . . . , cf) p , denoted span(Q p ). On a practical point of view, we 
simply calculate numerical parameters a.i{g) that minimize 

m / p \ 2 

9(xj) a i(9)H x 3) ■ 

j=l \ 8=1 / 

This approach has two advantages over the general non linear representation 
technique. First it is faster as oti{g) is obtained very efficiently thanks to some 
simple linear algebra. Second it can lead to a simplify neural model. Indeed 
we can submit the numerical vector that represents a function ((cti(g))i<i< P 
to a numerical MLP (even if the observation points depend on the function, 
because p is the same for all functions). 

On a theoretical point of view, this solution is in fact a particular case of 
our approach. Indeed our approach is based on calculating an approximation 
of J fgd/x. In L 2 (/i), this is the scalar product. Let us consider the special 
case where we constraint weight functions / to belong to span(§ p ), i.e., / = 
Yh=i Pi(f)<t>i- We have obviously 



i=l 



If we knew the real projection of g on span($ p ), H(g), we would be able to 
replace / g&dfi by JH(g)(f)id[j,. This is not the case, but we can still assume 
that / g(f>i dfi is approximately equal to Y?j=i a j(.9) I fij&t dyU- Therefore J fg dfi 
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is approximately equal to Yh=i Yfj=i Pi(f)&j(d) I d/-t- Let us denote M the 
matrix My = / <pi<j)j d/i. As {4>i)i<i<p is a free system, M is a full rank matrix. 
If we denote -y(f) = Mfl(f), we have 



f P 

J fg&v -Z)7j(/)«i(</)- 



Moreover, given a vector of coefficients c, we can define a function t by 



8=1 



with d = M 1 c such that 

P v v 

I tgdfj,- J2jj(t)aj(g) = Y, c J a A9)- 

3=1 3=1 

Therefore, a linear combination of the (approximate) coordinates of g on 
span(Q p ), is always approximately equal to the scalar product of g with a 
well chosen weight function /. Our method approximates / fgdfi by another 
formula. It is obvious that for the limit case, we end up with identical values 
and therefore that our approach contains as a special case the representa- 
tion based approach. As in the previous section, this might be even clearer 
with a simple one hidden layer perceptron with an unique real output. The 
representation based approach maps g to 



t=i \ i=i 



V(g)=J2 a i T \ b i + t24ai(9)}, (15) 
whereas our model gives 



fc / i m / p \\ 

H(g) = $>T [b t + - £ (g( Xj ) + e 3 ) £^<M*i) • (16) 

i=i V j =i ^' =1 H 

According to the previous discussion, to obtain nearly identical values, we just 
have to choose d such that d l = M _1 c J , for all % . Of course, on a numerical 
point of view, results might be slightly different (as will be illustrated in the 
following section), but the truncated basis approach can still be considered as a 
different implementation of a special case of our approach. More sophisticated 
truncated basis ap proaches, involving for instance a roughness penalty as in 
Besse et al.l ( 1l997l ). depart more from the solution proposed here and should 



be studied independently. 
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6 Experiments 



6.1 Introduction and experimental setting 



In the present section, we illustrate the proposed approach on two supervised 
classification experiments. The first dataset, s tudied in section 6.2, consists 
in the traditional waveform data introduced in iBreiman et al.l ( 119841 ) . In this 
synthetic example, the goal is to classify examples into three classes. The 
second dataset, studied in section 6.3, consists in a real world spectrometric 
problem in which near infrared absorbance spectra are used to recognize high 
fat and low fat meat samples. 



Both datasets have been used in iFerraty and Vieul (120031 ) to illustrate the 



efficiency of the non-parametric f unctional kernel based m odel proposed in the 
corresponding paper (and also in IFerraty and Vieul (120021 )). We will therefore 
compare results obtained thanks to neuron a l app roaches to functional and 
classical methods used in IFerraty and Vieul (120031 ) . Those methods include 
the above mentioned kernel based model as well as the linear model, Partial 
Least Square Regression, CART, etc. 



We have considered three variations of the Multi Layer Perceptron: the classi- 
cal MLP applied to raw data, the functional approach presented in this paper 
and the alternate implementation of the functional approach based on projec- 
tion on a B-spline basis (see section 5.2). 

In all our experiments, we have used a conjugate gradient training algorithm, 
with 10 different random initializations. To avoid over-fitting, we used a weight 
decay penalization term. To select both the architecture of the MLP and the 
value of the weight decay constant, we have used /c-fold cross-validation (with 
k = 5). Finally, performances of the selected MLP have been evaluated on a 
test sample. 



6.2 Breiman waves 



6.2.1 Classification results 



We start our experi ments with syn t hetic data, more precisely with waveform 



data introduced in IBreiman et al.l (119841 ). This is a three-class problem in 



which each class is obtained thanks to convex combination of three shifted 
triangular waveforms. The generating waveforms are continuous curves defined 
on [1,21] by: 
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hi(t)=max(6 - \t - 11 1, 0), 
h 2 \t) = h x {t-A), 
h 3 (t) = h 1 (t + A). 

Functions to classify have the following general forms: 



(17) 
(18) 
(19) 



x(t) =uh\(t) + (1 — u)h<z{t) for class 1, 
x(t) =uh\{t) + (1 — u)hs(t) for class 2, 
x(t) = uh 2 (t) + (1 — u)h 3 (t) for class 3, 



(20) 
(21) 
(22) 



where u g]0, 1[. In iBreiman et al.l (11984 ) each function is transformed into a 
vector from IR 21 thanks to an uniform sampling on [1,21]. An independent 
standard Gaussian noise is added to each observation. 



In ord er to stay closer to the functional framework, we follow iFerraty and Vieu 
(120031 ) and work therefore with vectors from R 101 which correspond to an 
uniform sam pling of each func t ion on [1,21]. The training sample is obtained 
exactly as in IFerraty and Vieu (l2003h : we have 150 functions in each class (in 
order to build such a function the parameter u is chosen uniformly in ]0, 1[, 
independently for each function) and an independent standard Gaussian noise 
is added to each observation. The test sample is generated with the same 
method but contains 250 functions in each class. 



As explained in the introduction, we have compared three neuronal ap- 
proaches: a naive approach in which M 101 vectors are directly submitted to a 
classical one hidden layer perceptron, our functional approach in which func- 
tional weights are represented thanks to B-splines and the alternate imple- 
mentati on of our method based o n projection on the same B-splines basis. We 
refer to IFerraty and Vieu! (120031 ) for comparison with classical methods and 
the non parametric functional method introduced in the paper. Table 1 gives 
the obtained results for the three neural methods (MLP corresponds to the 
naive approach, FMLP to our functional approach and FpMLP to the alter- 
nate implementation of t his approach). Results h ave been averaged over 50 
simulations, exactly as in IFerraty and Vieu! (120031 ). so as to ease comparison 
with existing results. 



Method 


Test classification error rate 


Standard deviation 


MLP 


0.098 


0.013 


FMLP 


0.065 


0.0096 


FpMLP 


0.072 


0.011 



Table 1 

Waveform data 



Results are very satisfactory. First of all, our functional approaches overcome 
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the classical MLP method (the main functional approach gives the best re- 
sults). Result summary provided by table 1 does not give complete informa- 
tion. Indeed, as for each simulation the same data set is used for each method, 
a direct comparison between obtained results is possible. An important result 
is that for all simulations, functional approaches overcome the classical MLP. 
The mean performance increase is 3.2 percent for our main implementation 
and 2.6 percent for the alternate projection based implementation. Moreover, 
the main implementation overcomes the projection based one on 38 simula- 
tions (and the mean performance increase is 0.6 percent). 



According to results reported in lFerraty and Vieul (120031 ). the functional MLP 



approach outperforms both classic al methods (such a s CAR T) and functional 
ones. The best method studied in iFerraty and Vieul fl2003h achieves a mean 



classification error rate of 0.072 (with a standard deviation of 0.012). We can 
therefore conclude that our functional MLP is among the best methods for 
this dataset and that it overcomes both traditional methods and a classical 
neural approach. Moreover, as explained in the following section, the obtained 
functional model is very parsimonious which gives it robustness and efficiency 



6.2.2 Parameter numbers 

For all methods, we select the best number of hidden neurons among 2, 3 
and 4 hidden neurons. For the functional approaches, weight functions were 
represented using 5, 7, 10, 15 or 20 B-splines (those numbers have been cho- 
sen to keep the architectures as simple as possible). The chosen architecture 
depends on the simulation, but in general, small architectures are preferred, 
as summarized by the following tables. Table 2 gives the number of time each 
B-splines basis has been chosen and table 3 gives the number of time each 
number of hidden neurons has been chosen. 



Number of B-splines 


5 


7 


10 


15 


20 


FMLP 


18 


17 


8 


3 


4 


FpMLP 


38 


8 


4 









Table 2 

Number of simulations that select the given number of B-splines 



Number of hidden neurons 


2 


3 


4 


MLP 


10 


40 





FMLP 


9 


29 


12 


FpMLP 


10 


28 


12 



Table 3 

Number of simulations that select the given number of hidden neurons 
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For our main functional approach, the total number of numerical parameters 
used varies between 23 and 103, with a mean of 44 (the median is 39 and 
only 10 simulations needed more than 51 parameters). For the projection 
based implementation, the total number of numerical parameters used varies 
between 23 and 63, with a mean of 36 (the median is 33 and only one simulation 
out of 50 uses more than 51 parameters). The projection based approach uses 
therefore even less parameters than our main functional approach, but with a 
slight decrease in the performances. 

For the naive approach, cross-validation selects 3 hidden neurons for 10 sim- 
ulations and 4 hidden neurons for the other 40 simulations. Those values 
correspond respectively to 321 and 427 numerical parameters (the mean is 
406). The naive approach uses therefore far more parameters than functional 
methods and gives worse results. 



The best method studied in lFerraty and Vieul (120031 ) is a non-parametric func- 



tional method in which functions are first projected on an optimal basis con- 
structed thanks to multivariate partial least squares regression. Optimal re- 
sults are obtained thanks to a projection on three basis functions (this number 
is selected thanks to /c-fold cross-validation). As the method is kernel based, 
we have to store all the functions of the training sample. That is, we need to 
keep a vector of R 101 for each basis function (303 numerical parameters) as 
well as the coordinate of each training function of this basis (3 parameters for 
each function). We have therefore a total of 1653 numerical parameters. 



6.2.3 Pre- smoothing 

A possible explanation for the poor performances of the standard MLP is that 
Breiman waves are very noisy. One side effect of using function representation, 
either for the functional weights or for the data themselves, is to smooth the 
waves. It is therefore quite natural to investigate the effect of applying a spline 
smoothing method on the waves before submitting them to a standard MLP. 

In order to implement a fair comparison, we have used the following method: 
we calculate coordinates of training and test waves on each B-spline basis con- 
sidered in the previous series of experiments. These coordinates are used to 
reconstruct smooth versions of the waves that are sampled exactly as the orig- 
inal waves (101 points regularly spaced in [1,21]). The obtained M 101 vectors 
are then submitted to a classical one hidden layer perceptron. The number of 
B-splines used for the smoothing phase, the number of hidden neurons and 
the weight decay are then selected by fc-fold cross validation (with k = 5). 



The test set performances are much better than with the basic MLP approach. 
Indeed, the mean error rate is now 0.073 (with a standard d eviation of 0.012), 
which is comparable to the non-parametric approach of iFerraty and Vieu 
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( 120031 ) and to the projection based implementation of the functional MLP. 
Nevertheless, a direct comparison shows that in fact our main implementation 
performs better than the smoothing approach for 42 simulations on 50. The 
projection based implementation obtains better results for 27 simulations. The 
basic MLP approach obtains better results than the smoothing approach for 
1 simulation out of 50. 

It seems therefore that smoothing plays an important role in obtaining good 
performances, but also that it does not help in reducing the number of pa- 
rameters. Indeed, the mean number of parameters used by the smoothing 
approach is 406. With only 44 parameters, our main implementation obtains 
slightly better results. 



6.2.4 Comments 



Table 4 summarizes the result obtained on the Breiman waves. It is clear 
that the functional MLP approach gives very satisfactory results on those 
data. The o btained classificat i on rat e is slightly better than the best results 
reported in iFerraty and Vieul (120031 ). which means that the MLP approach 
performs better than both traditional approaches and functional approaches. 
Moreover, the functional MLP approach also overcomes a naive MLP modeling 
of the raw multivariate data, as well as a more complex method in which a 
spline smoothing is performed on the raw data before submitting them to a 
classical MLP. Finally, the obtained model is very parsimonious: the MLP 
classifier will be faster than the kernel based one (after training). 



Model 


Parameters 


Error rate 


Training time 


FMLP 


44 


0.065 


4.5 


FpMLP 


36 


0.072 


3.8 


Non parametric 


1653 


0.072 





MLP with smoothing 


406 


0.073 


5.9 


MLP 


406 


0.098 


1 



Table 4 

Results summary 

Table 4 shows also the relative cost of the studied methods in terms of training 
time: the total training time of the classical MLP applied on raw data has been 
chosen as the reference training time (the values include the c ross y alidation 
phase). As the non parametric approach of IFerraty and Vieul (120031 ) involves 
almost no training phase (except for the selection of the kernel width), it has 
been considered as almost instantaneous compared to MLP training. Most of 
the cost comes from the model selection phase. Indeed, for the basic MLP, 
we just have to select the weight decay parameter and the number of hidden 
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neurons. On the contrary, all other methods involve the selection of the rep- 
resentation basis (here the number of B-splines). An interesting point is that 
the functional approaches are faster to train than the smoothing approach, 
give better results and produce very parsimonious summary of the data. 

Compared to a classical MLP, the functional approach implies to use around 
4.5 times more processing power in the training phase. Fortunately, the train- 
ing is done only once and allows to produce a very small footprint solution 
than can be implemented on a small device such as a cell phone or a PDA, 
and with recognition performances that are significantly better than those of 
the classical MLP. 



6.3 Spectrometric data 



6.3.1 Raw data 



Our next example is a real world classification problem of spectrometric data 
from food industry. Each observation is the near infrared absorbance spec- 
trum of a meat sample (finely chopped), recorded on a Tecator Infratec Food 
and Feed Analyser. More precisely, an observation consists in a 100 channel 
spectrum of absorbances in the wavelength range 850-1050 nm. The goal is 
to classify meat samples into high fat samples and low fat samples. The first 
class consists in meat samples with less than 20% of fat, whereas the second 
class contains all other meat samples. We have a total of 215 spectra. Data are 
not organized into a training sample and a test sample, therefore , we follow 
exactly the evaluation method described in lFerraty and Vieul ( 120031 ): we select 
randomly 160 training spectra and 55 test spectra. We repeat this operation 
50 times and give the average classification error rate. 



We have compared the three approaches described in the introduction. The 
preprocessing experimented in section 6.2.3 was not considered here because 
absorbance spectra are very smooth and a B-spline basis projection as no 
noticeable smoothing effect on those functions. Table 5 gives statistical sum- 
maries of the classification error rate obtained by those neural methods. In 



Method 


First quartile 


Mean 


Median 


Third quartile 


MLP 





0.019 


0.018 


0.036 


FMLP 


0.018 


0.028 


0.036 


0.036 


FpMLP 





0.018 


0.018 


0.036 



Table 5 

Error rate for Spectrometric curves 



this situation, only the alternate implementation of the functional approach 



6 EXPERIMENTS 



24 



gives satisfactory results. Indeed, the naive MLP approach gives better re- 
sults than our main functional implementation. As in the previous section, 
generated data sets are identical for each method and a direct comparison 
between obtained results is possible. The naive method performs better than 
the FMLP method on 21 data sets (identical performances are obtained on 27 
simulations). 



But the FpMLP method still performs better than the naive approach. The 
average performance improvement is only 0.001, but FpMLP performs bet- 
ter than MLP on 30 simulations (identical performances are obtained on 13 
simulations). We can therefore conclude that the best functional approach 
gives slightly better performances than the M LP approach. Moreover, the best 
method reported in iFerraty and Vieul (120031 ) obtains a median classification 
rate of approximately 0.022, which shows again that neural methods perform 
very well. Additionally, the best method reported in IFerraty and Vieul (120031 ) 
is as in previous section a mixed method that uses a functional non para- 
metric model on functions projected on an optimal basis ge nerated thanks to 



non fu nctional multivariate partial least squares regression. IFerraty and Vieu 



( 120031 ) reports that a pure functional approach (in which functional principal 
component analysis is used to design an optimal projection) gives very bad 
results (the mean error rate is 0.2). On the contrary, our methods are pure 
functional methods and still give the best results. 



Moreover, functional methods use a small number of numerical parameters. 
For all methods, we select the best number of hidden neurons among 2, 3 and 
4 hidden neurons. For the functional approaches, weight functions were rep- 
resented using 15 or 20 B-splines. In general, methods choose a small number 
of neurons, as shown in tables 6 and 7. 



Number of B-splines 


15 


20 


FMLP 


12 


38 


FpMLP 


24 


26 



Table 6 

Number of simulations that select the given number of B-splines 



Number of hidden neurons 


2 


3 


4 


MLP 


24 


16 


10 


FMLP 


32 


11 


7 


FpMLP 


18 


17 


15 



Table 7 

Number of simulations that select the given number of hidden neurons 



The classical MLP approach uses between 213 and 423 parameters, with a 
mean of 288 parameters. The main functional approach uses between 43 and 
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103 parameters (the mean is 62), whereas the projection based approach has 
the same range of parameter numbers wit h a higher mean (69). The best 
method reported in iFerraty and uses 1300 parameters (almost 19 

times more than our best method) with slightly worse performances. 



6.3.2 Second order derivatives 



Ferraty and Vieul (120021 ) and IFerraty and Vieul (120031 ) point out that the sec- 
ond derivative of the spectrum is in general more in formative than the spec- 
trum itself. The non parametric approach proposed in IFerraty and Vieul (120031 ) 
has been used with a second derivative based semi-metric and achieved better 
results than the optimal projection based method. Indeed, the median error 
rate of a pure functional approach is now slightly less than 0.022. This method 
turns out to be the best overall method. 



We have therefore applied our f unctional MLP approach es to the second 
derivative of the spectrum. As in IFerraty and Vieul (120031 ) . we evaluate the 
spectrum thanks to a B-spline representation. The second derivative of the 
B-spline is calculated exactly and sampled uniformly on [850, 1050] as the 
original data. We obtain therefore new functional data that we model as nor- 
mal functional data (that is we forget the preprocessing phase). 

Table 8 gives statistical summaries of the classification error rate obtained by 
the neural methods applied to the second order derivatives. 



Method 


First quartile 


Mean 


Median 


Third quartile 


MLP 





0.013 


0.018 


0.018 


FMLP 





0.007 





0.018 


FpMLP 





0.014 


0.009 


0.018 



Table 8 

Error rate for second order derivatives of the Spectrometric curves 

We obtain very sa tisfactory results a s all n eural methods perform better than 
results reported in IFerraty and Vieul (120031 ). Moreover, the best results are ob- 
tained by our main functional MLP implementation. A direct comparison be- 
tween results obtained for each simulation shows that FMLP overcomes MLP 
on 15 simulations (identical performances are obtained on 31 simulations). 
The FMLP also overcomes FpMLP on 19 simulations (identical performances 
are obtained on 20 simulations). In fact FMLP provides perfect classification 
of the test set for 34 simulations, whereas this number drops to 25 for FpMLP 
and to 24 for MLP. 



We do not report completely architecture selection results as they are very 
similar to those obtained on the raw functional data. MLP uses a mean number 
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of 391 parameters, FMLP 82 and FpMLP 76. 



6. 3. 3 Comments 

As in the Breiman wave experiments, an appropriate functional MLP model 
allows to obtain very good recognition rate that cannot been reached by a 
classical MLP. Moreover, the optimal functional MLP uses a small number of 
parameters, which eases its real world implementation. We have not reported 
here training times as they are comparable to values reported in table 4: the 
price to pay for higher recognition rate and lower parameter number is a higher 
training time than the one needed for a classical MLP, mainly because of the 
additional parameter (the number of B-splines) that has to be chosen by cross 
validation. 



6.4 Conclusions 



In both experiments (on simulated data and on real world data), functional 
multi-layer perceptrons perform in a very satisfactory way. They are at least 



as goo d as functional and traditional methods presented in lFerraty and Vieu 



( 120031 ). Moreover, they also overcome a naive MLP modeling of the raw mul- 
tivariate data. A way to obtain correct results with a classical MLP is to 
perform a kind of functional preprocessing: a spline smoothing for noisy data 
such as the Breiman wave or a derivative calculation for smooth data such 
as the absorbance spectra. But even those mixed approaches do not perform 
as well as the functional MLP. Another important practical property is the 
small number of numerical parameters used by the functional neural methods: 
this allows an easier implementation on devices with limited resources such as 
PDA, cell phones and more generally embedded devices. 

Of course, additional experiments on real world data are needed to fully under- 
stand advantages and shortcomings of the proposed functional MLP. While 
the model has been compare d to traditional cl a ssifica tion methods thanks 
to experiments conducted in iFerraty and Vieul (120031 ). additional compar- 
isons, especially to recent me t hods s uch as support vector machines (see e.g. 
Cristianini and Shawe- Taylor ( 2000l )) or boosted classification trees (see e.g. 
Hastie et al.l (l200ll )). are also needed. 



An interesting open research topic is to develop automatic tuning of weight 
function rep resentation. We have used here a brute force /c-fold cross-validation 
method but IFerraty and Vieul (120031 ) shows that automatic design of projec- 
tion basis can improve performances. Moreover, this might reduce the training 
time of functional MLP which remains the only negative part of the proposed 
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approach compared to classical MLP (when the latter is used without func- 
tional preprocessing). 



7 Conclusion 



In this paper, we have introduced Functional Multi-Layer Perceptrons 
(FMLP), a simple extension of MLP to functional data. The proposed model 
is very interesting on a theoretical point of view because it shares with its 
numerical counterpart useful properties. 

We have indeed shown that FMLP are universal approximators, that is they 
can approximate continuous mappings from a compact subset of a functional 
space to R with arbitrary precision. For a given function to approximate to 
a given accuracy, the approximating FMLP uses a finite number of numerical 
parameters. 

Moreover, we have shown that parameter estimation for FMLP is consistent: 
optimal parameters estimated thanks to a finite number of functions known 
at a finite number of measurement points converge to the set of true optimal 
parameters when the size of the data increases. 

We have also shown on simulated and real world data that the FMLP per- 
forms in a very satisfactory way. Performances are in general better than 
those obtained by non functional methods (including neural methods) and at 
least as good as other functional methods. Moreover, the functional approach 
gives much more parsimonious representation of studied data, a property that 
enhance the robustness of the obtained models and allows also an easier im- 
plementation on devices with limited processing power. We believe therefore 
that Functional Multi-Layer Perceptrons are a valuable tool for data analysis 
when a functional representation of input variables is possible. 
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8 Proofs 



Proof of corollary 6 If 1 < p < oo, we k now that L q ( p) (with q < oo) can 
be identified with (L p (p))* (see for instance iRudinl ( 1974 )). More precisely, for 
each / G (LP(p))* there is an unique function / G L q (p) so that 1(g) — f fg dp. 
By hypothesis, V is dense in L q (p). This obviously implies that Ay is dense 
in (L p (p))* for the weak * topology. We can therefore apply corollary 5.1.3 of 
Stinchcombd (119991 ) (note that corollary 5.1.3 is given for the outside density 
case, but the author states explicitly that a similar inside corollary is valid). 



If p = oo, we cannot apply directly corollary 5.1.3 from IStinchcombe f|l999h as 
the dual of L°°(p) is not L l (p). Let us nevertheless consider A the set of afinne 
functions on L°°(p) defined by 1(f) = a + J fg dp, where a is an arbitrary 
real number and g is an arbitrary function from V C L x (p). A is obviously a 
vectorial space which contains constant functions of C(K, R). Let us now show 
that A separates points in K. Let u and v be two distinct functions of K. The 
function / = u — v is a non zero function belonging to L°°(p). We can assume 
that the measurable set H = {x G W 1 | f(x) > 0} has non zero finite measure 
(if it is not the case, replace / by — / ). Then, obviously / fxiidp > 0, that 
is / uxh d/i ^ / vxh dfi. As p, is finite, xh belongs to L 1 (/i). As V is dense in 
L 1 (/i), there is a sequence hk of functions in V that converges to xh- We have 
obviously 



J f(hk ~ Xh) d/i < l/L J h k - xh d/i 



Therefore, there is an index k such that / fh^dfj, > 0, that is there is a 
function hk G V such that / uhk d/i ^ f vhk d/i. Therefore, A sep arates points 
i n K. The conclusion is then obtained by applying theorem 5.1 of lStinchcombe 
jl999h . 



Pro of of coro llary 7 As p, is a finite Borel measure on M™, it i s regu- 



l ar (jRudinl (119741 ). theorem 2.18), and we can apply Lusin theorem (IRudin 



(119741 ). theorem and corollary 2.23). We know therefore that for any function 
/ in L°°(p), there is a sequence of compactly supported continuous functions 
gk that converges punctually to / and such that \gk\oo < I /loo- A simple 
application of Lebesgue dominated convergence theorem shows that for any 
function h in L x (p), f g^h dp —>k->oo f fh dp. Then, as p is compactly sup- 
ported, there is a compact K such that / g^h dp = f K g^h dp. Then, thanks 
to hypothesis, each <j% can be approximated by a function <pk in V such that 
sup x€K \g k (x) - (j) k (x)\ < \. In this case \ J K g k h dp - f K <p k h dp\ < | \\h\\ v As 
p is compactly supported, this allows to conclude that / fah dp >■ / fh dp. 

Therefore, the set of linear forms Ay is dense for the weak * topology in 
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(L 1 (/i))*, provided that \i is finite and compactl y supported. The co nclusion 
is then obtained by applying corollary 5.1.3 from Istinchcombel ( 1999 ). 



Proof of theorem 8 The proof is quite technical and can be cut into several 
parts: 

(1) We need first a quite general Uniform Strong Law of Large Numbers 
(USLL N) which will be obtained thanks to a general result of lAndrews 
fll987h . 

(2) Then we show that integral approximations used in the definition of 
\™{w) have a kind of uniform convergence property. 

(3) Using both results, we show that A™(w) converges almost surely uni- 
formly to X(w). 

(4) The conclusion is obtained thanks to a simple lemma on approximation 
of the minimizers of a function. 



part 1 

A very g e neral Uniform Strong Law of Large Numbers (USLLN) is given in 
Andrews! ( 119871 ). It is based on complex assumptions, so we propose to simplify 
it into the following corollary: 

Corollary 9 Let X be an arbitrary metric space considered with its Borel 
sigma algebra. Let (Q,A,P) be a probability space on which is defined a se- 
quence of independent identically distributed random elements, Z t with values 
in X . Let W be a compact metric space. Let I be a function from W x X to 
K. We assume that the following conditions hold: 

(1) For each w G W, l(w, .) is a measurable function from X to M.. 

(2) For each x G X, l(.,x) is a continuous function from W to R. 

(3) there is a positive measurable function d (from X toM.) such that for all 
x G X and for all w G W, \l(w, x)\ < d{x). 

(4) E{d{Z t )) < oo. 



Then we have: 



sup 

w€W 



1 n 

n 



E(l(w,Z t )) 



i=i 



In order to prove this corollary, we need first a simple lemma: 

Lemma 10 Let I be a function from W x X to R ; where W is a separable 
metric space and X is a metric space (considered with its Borel sigma algebra). 
If I is continuous on W for each fixed x G X and measurable on X for each 
fixed w G W , then the function f(x) = sup w( z W l(w, x) is measurable. 



8 PROOFS 



30 



Proof of lemma 10 As W is separable, there is a denombrable set W = 
{wi | i G N*} dense in W. Let us show that f(x) = sup weW , l(w, x). Let 
us consider a fixed x G X. Let e be an arbitrary positive real number. By 
definition of /, there is w G W such that l(w,x) > f(x) — I. As l(.,x) is 
continuous in w, there is 77 such that \w'—w\ < rj implies \l(w',x)—l(w,x)\ < |, 
which implies l(w',x) > /(x) — e. As W is dense in VF, there is u>' G W 
such that |«/ — w \ < rj. This implies f(x) > sup weW , l(w, x) > /(x) — e. As 
this is true for each e, we have obviously /(x) = swp weW , l(w,x). Therefore, 
/(x) = sup ieA r /(wj, x). As each function l(wi, x) is measurable, the sup is also 
measurable. 



We can now proceed to the proof of the corollary: 



Proof o f corollary 9 W e obtain corollary 9 as a consequence of Andrews' 



theorem (jAndrewd (119871 )). We have to check three assumptions: 



(1) Assumption Al is fulfilled as W is compact (W corresponds to O in 
Andrews' paper) 

(2) Assumption A 2 breaks into two sub-assumptions: 

(a) Assumption A 2 (a) can be translated with our notation into 
the following assumption: for all w (and all i), l(wo,Zi), 
sup weW{wo r)) l(w,Zi) and mi weW{w0tri) l(w, Z { ) are random vari- 
ables (where W(wo,rj) = B(u>o,e) D W, and B(wo,e) is the closed 
ball centered on wo with radius e). 

l(wo, Zj) is a random variable thanks to assumption 1 of corollary 
9. Thanks to assumptions 1 and 2 of corollary 9 and due to the fact 
that a compact set is separable, lemma 10 can be applied to I and 
to W(wo,rj), and allows to conclude that swp weW f Wo ri \l(w, Zi) is a 
random variable. The case of inf^gv^^o,^) l(w, Zt) is handled thanks 
to the same lemma applied to — I. 

Assumption A 2 (a) is therefore fulfilled. 

(b) Assumption A 2 (b) translates in our case into the assumption that 
sup weW{m )J?) l(w, Zi) and ini weWiw0jV ) l(w, Zi) satisfy a point-wise 
strong law of large numbers, that is for any fixed Wq\ 

lim — sup l(w,Zi) = El sup l(w,Zi)\ P a.s. 

n ^°° Tl w£W(wo,r]) \w£W (wo, r)) ) 



As shown in the previous point, both (snp weW ^ WQ ^ l(w, Z it f 
and ^inf we vy(w ,»?) K w i . w are sequences of independent identi- 
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cally distributed random variables. Moreover, thanks to assumptions 
3 and 4 of corollary 9, they are integrable and therefore the strong 
law of large numbers applies: assumption A 2 (b) is therefore fulfilled. 
(3) Assumption A 3 translates in our case into the following assumption: 



lim sup 

J7—0 n>1 



-Y,{e( sup l(w,Z t ))-E(l(w,Z t ))) 
n j=i \ \wew(w ,v) ) ) 

A similar equation has to be fulfilled by E (ini weW ^ W0 ^ l(w, Z^. 



As I is continuous with respect to w for a fixed x, we have the following 
point-wise convergence: 

lim sup l(w, .) = l(w , .). 

V-* weW{w ,r)) 

Thanks to assumptions 3 and 4 of corollary 9, we can apply Lebesgue 
dominated convergence which implies: 

lim Ey sup l(w,Zi)\ = E (1(wq, Zi)) . 

Finally, as are identically distributed, assumption A 3 can be simplified 
into: 



lim 



El sup l(w, Zi) J — E (l(w, Zxj) 

\w€W(wo,r)) / 

which is exactly what we have just proven. The case of 
E (\ni weW ( W0 ,rj) K w > Zi)) can be obtained exactly the same way. 

Assumption A 3 is therefore fulfilled. 

As the assumptions are fulfilled, we can apply Andrews' theorem which gives 
exactly the conclusion of corollary 9. 



part 2 

Let us define: 

1 TO 

Mt(g, Wl )(u) m = -E Fi(*>h JQH) + ^H) , 



j'=i 



which can be simplified into M{(g,wi) m when uj is obvious, and 

Mi(g,wi) = / F l (w l ,x)g(x)dn(x) 



We prove now the following lemma: 
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Lemma 11 Let us define 



^ = <L G ft I lim — Y dAX)) = fdidfi 

I m->oc mj^{ J J 



and 



B\ = \ ioeQ\ygeC{Z,R), lim sup Mf(g, »,)(w) m - M^g, w t ) 



Under hypothesis H a , H c and H e , B\ f| Vti is measurable and P(B\ D fli) = 1. 



Proof of lemma 11 The proof is based on the separability of C(Z, R) and 
on corollary 9. Let us first note that -P(f^) = 1 thanks to hypothesis H c (2-c) 
and the strong law of large numbers. Let us first show that the set 



B\(g) = \ueQ \ lim sup M}(g, w t ){u) m - M t (g, w t 



wiEWi 







is such that P(Bf(g)) = 1 for any g G C(Z,R). 

This can be obtained by applying corollary 9 to the function i/j from W\ x 
(Z x E) to 1 defined as follows 

ij)(w h (x, e)) = Fi(w h x){g(x) + e), 

and to the sequence of random elements (Xj, Efjj^n- Corollary 9 applies be- 
cause: 



• Wi is compact (hypothesis H c (1)) 

• hypothesis H c (2-b) implies that ip is measurable with respect to its second 
variable 

• hypothesis H c (2-a) and g G C(Z, R) implies that ip is continuous with 
respect to its first variable 

• hypothesis H c (2-c) implies \/x E Z, w t G W 7 / and Ve G R, (^e))! < 
^(x)(|^)| + |e|) ' 

• as g is continuous on the compact set Z, g G L q ([i) and therefore 

G -^ 1 (/u) (according to hypothesis iJ c (2-c)) 



• as E (j^jj 9 ) < oo (hypothesis H e (5)), E(di{Xf ; 

• (Xj,£j) jG N is i.i.d. (hypothesis H e ) 



< oo 



Therefore, we have: 



sup 

WiGWi 



Mt(g, Wl ) m - E (Piwt, X\) (g{X{) + £{)) 



0. 
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By definition, E (Fi(wi, X{)g(X{)) = Mi(g,wi) and by independence and hy- 
pothesis H e (5): 

E(F l (w l ,X[)ei)=E(F l (w l ,Xi))E(ei)=0. 

Therefore: 

sup M i l {g,w l ) m -M l {g,w l ) ^0, 
w t eWi 

which means that P(B\(g)) = 1. 



As C(Z,M) is separable, there is a sequence (h t )t<=n dense in C(Z,M) (for the 
uniform norm). Let us denote A\ = fl f)teN B\{h t ). A\ is measurable and 
P{A\) = 1. Obviously, £/ nfiiC Aj. Let us now show that B\ n fi/ = A£. 

Let u; G AJ. Asw G f^, ^ Sj=i ^«(^j( u; )) * s a convergent sequence and is there- 
fore bounded, so there is "fi(u) > 1 such that for all m, ^ Y%Li di{X l j{uj)) < 
7; (a;). Moreover, we can choose suc h that 7j(a>) > E(di(X[)). 

Let <? G C(Z, R). For any e > 0, there if t G N such that pz(g,h t ) < 3 y ^ . 

This obviously imply for all iuj G Wi and for all m both |M[(iy/, y)(o;) m — 
Mf(wi,ht)(u) m \ < § and |M/(w;, <?) - Mi(w t , h t )\ < |. As w G Aj, 
M}(wi, h t ){u)) m converges to Mi(wi,h t ) uniformly on Wi. Therefore there is 
M such that m > M implies sup Wi€W[ \Ml(wi, h t )(u) m — Mi(wi, h t )\ < |. Then 
m > M implies sup 6W , \M}(wi, g)(u) m —Mi(wi, g)\ < e. As this is true for any 
e, we conclude that Mf^wi, g)(u) rn converges uniformly on W\ to Mi(wi, g), and 
therefore that u G B\(g) fl fy. As this is true for all u G -B; fl f^. Therefore, 
.Bj fl = A], which gives the conclusion of the lemma. 



part 3 

Let us now apply corollary 9 to X n (w), more precisely to the function from 
W x (C(Z, R) x R°) to R define by: 

k(w,g,t) = 

c(t,U [wo, j F 1 (w 1 ,x)g(x) dfi(x),..., J F k (w k ,x) g(x) d/x(a;)^ . 

This is possible according to the following reasons: 

• W is compact 

• k is continuous on w for each (g,t), according to hypotheses H c and be- 
cause, as a continuous function defined on a compact set, g belongs to 
L q (fi). Indeed, wi 1— > / Fi (wi, x) g(x) dfj,(x) is continuous for each g: as Fi 
is continuous on w for each x, the function (w[, .) #(.) converges punc- 
tually to Fi (wi, .) g(.) when to,' converges to wi. Moreover, \Fi (w, ■) g(-)\ is 
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dominated on Wi by di(.)\g(.)\, which is integrable (by hypothesis) . Thanks 
to dominated convergence theorem, this obviously implies the continuity of 
wi^ J Fi (wi, x) g(x) dfi(x). 

• k is measurable with respect (g, t) for each w. This is a direct consequence of 
hypotheses H c and of the fact that g i— > J F t (wi, x) g(x) d/i(x) is continuous 
for each wi 

• hypothesis H d implies that k(w,g,t) < c max (t) for all w, g and t, with 
E{c max (Ti)) < oo 

• (G\T% m isi.i.d. 

According to corollary 9, we therefore have 



sup 



n 



J2Hw,G\T)-E(k(w,G\T 1 )) 



i=i 



that is 



sup 

wew 



X n (w) - X(w) 



0. 



(23) 



Let us call C the set of probability 1 for which this uniform convergence 
occurs. Let us now consider D = C fl flieN leni^l l~l O;). According to lemma 
11, P{D) = 1. Let uj be an arbitrary element of D and denote for simplicity 
g % = G % {uj) and f = T l (uj). Let e > be an arbitrary real number. According 
to equation 23, there is iV such that for each n > N, 



sup 



-J2k(w,g\f)-X(w) 



n 



i=i 



e 



(24) 



We handle here the case where c is not a distance on R° but simply a continu- 
ous positive function. As U is bounded and uniformly continuous, the function 
l(t,WQ,u) = c(t,U(wQ,u)) from M° x Wq x M fc is uniformly continuous with 
respect to (wo,u). That is, for each t % , there is rji > such that for each wq 
and (u,u') e R k x R k , \\u - u'\\ < r] =>• ||/(f ,w ,w) - Z(f , w , u')\\ < f. As 
cu E f|j G N leni^l n f° r eacn ^ there is S* 1 such that m ! > 5* implies for all 
/ 

sup \Mftwt, g l )(uj) m r - Miiw^g 1 )] < rji. 
Let us call S n = sup i<ri S\ Then for m > S n , for all w and for alH < n 



(f , U (w , Ml( Wl ,g*)(u) m , Ml(w k , g%to) 

- c (f, U (w , M^g 1 ), M k (w k , g 1 ))) 



e 

< 2- 
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that is for all w G W 



- £ c (f , U (w , Mi(w 1 ,g i )(u) m , M l k (w k , g^u) 
n i=i 



n 



i=l 



e 

< -. 
2 



Combined with equation 24, this gives that for n > N and m > M(n): 

sup \\%(w) ~ K w )\ < e - 

w<EW 

Therefore for almost all u (i.e., for ou G D), we have: 

lim lim sup \\™(w) - X(w)\ = 0. 



(25) 



part 4 



The final conclusion of the theorem is obtained exactly as in IWhitd (119891 ) . 
We use the following lemma: 

Lemma 12 Let W be a compact set (considered with the metric d) 
and (fi)ien jen o, sequence of sequences of real valued continuous func- 
tions that converges uniformly to a continuous function f, that is 
lim^oo lim m ^oo pw(fni f) = 0- Let us call W* the set of minimizers of f 
and let w\ be a minimizer of f- . Then 

lim lim d(w™, W*) = 0. 

moo m-too 



Proof of lemma 12 First of all, it is clear that we just have to prove that 
the set of accumulation points of (tof)^ Acc, is included into W*. Indeed 
assume that both Acc C W* and that the conclusion of the theorem does not 
hold. This implies that there is an infinite subsequence of (wl) i€ ^ j 6 pj which 
distance to W* remains above a fixed positive number. As W is compact this 
subsequence has at least one accumulation point which cannot belong to W*. 
As this accumulation point is also an accumulation point of the full sequence, 
this contradicts our main hypothesis. 

Let us now consider w° an accumulation point of the sequence. Strictly speak- 
ing, w° is the limit of a subsequence of the main sequence, but to simplify the 
proof, we assume that w° = lim^oo linim^oo w™. 

Let e > 0. / is uniformly continuous on W and therefore there is r] such that 
\w' — w\ < 7] implies \f(w) — f(w')\ < e. By uniform convergence, there is 
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iV such that for each n > N, there is M n such that m > M n implies for all 
w G W, \f™(w) — f(w)\ < e. Moreover, we can choose N and M n such that 
n > N and m > M n imply \w™ — w°\ < rj. Therefore, n > N and m > M n 
imply <2e. 

As w™ is a minimizer of f™, for all w, f™(w™) — f™(w) < 0, which implies 
(by uniform convergence), fn( w n) ~ f( w ) ^ e - Therefore, f(w°) — f(w) < 3e. 
As this is true for all e, we conclude that f(w°) — f(w) < for all w and 
therefore that w° G W*. Therefore Acc C W*. 

The conclusion of the theorem is obtained by applying lemma 12 to all u> G 
D. For such a ct>, the uniform convergence of A™ to A translates into the 
convergence of any minimizer of A™ to the set of minimizers of A. 
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